[ROCm] Remove BF16 workarounds for PaddleOCR-VL on HIP by austin1997 · Pull Request #5096 · PaddlePaddle/PaddleX

austin1997 · 2026-04-18T04:14:14Z

Summary

PaddlePaddle/Paddle#78711 在框架层适配了 HIP BF16 剩余的算子缺口（layer_norm / softmax 注册 BF16 内核，conv2d_add[_act]_fuse_pass 在 HIP wheel 上不再注册）。配合已合入的 PaddlePaddle/Paddle#78587（BF16 conv 内核），PaddleOCR-VL 端到端 BF16 推理在 AMD GPU 上已经可以跑通。本 PR 同步移除 PaddleX 中针对 ROCm 的两类 BF16 workaround。

修复 #5095。依赖 PaddlePaddle/Paddle#78711 与 PaddlePaddle/Paddle#78587 都合入并发版后才能合并——否则旧 wheel 上 BF16 视觉子图仍会崩溃。

Changes

paddlex/inference/models/doc_vlm/modeling/paddleocr_vl/_paddleocr_vl.py：删除 _keep_in_fp32_modules = ["visual", "mlp_AR"]，让 SigLIP 视觉塔 + mlp_AR projector 跟随模型 dtype，无需强制 FP32。
paddlex/inference/models/runners/paddle_static/runner.py：删除 4 处 if paddle.is_compiled_with_rocm(): config.delete_pass("conv2d_add_act_fuse_pass"); config.delete_pass("conv2d_add_fuse_pass") 块（行 406-408、462-464、496-498、505-507）。

CUDA 推理行为完全不变。

与现有 PR 的关系

_keep_in_fp32_modules = None 部分与 fix(doc_vlm): remove ROCm BF16 _keep_in_fp32_modules workaround in PaddleOCR-VL #5077（fchange，Hackathon）改动重叠。本 PR 同时把 runner.py 中 4 处 delete_pass workaround 一并清理（fix(doc_vlm): remove ROCm BF16 _keep_in_fp32_modules workaround in PaddleOCR-VL #5077 未覆盖），所以两个 PR 任一先合后只需 trivial rebase 即可。建议合本 PR 时若 fix(doc_vlm): remove ROCm BF16 _keep_in_fp32_modules workaround in PaddleOCR-VL #5077 已先合，git 自动跳过 _keep_in_fp32_modules 的删除，余下 runner.py 部分继续生效。
与 PaddleOCR-VL: Remove ROCm BF16 _keep_in_fp32_modules workaround #5076（fchange）和 [ROCm] Remove BF16 workaround now that Paddle framework supports HIP BF16 #5081（Lau-JW）issue 描述的问题为同一组 workaround，详细差异说明见 [ROCm] PaddleOCR-VL 在 AMD GPU 上仍有 BF16 workaround：_keep_in_fp32_modules + 4 处 conv2d_add 融合 pass delete_pass 应在 Paddle 框架修复后整体移除 #5095。

Test plan

AMD MI300X (gfx942) / ROCm 7.2 / Paddle develop（含 feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm Paddle#78587 + [ROCm] 适配 HIP BF16: 注册 BF16 layer_norm、绕开 MIOpen BF16 softmax、HIP 跳过 cuDNN-only conv2d 融合 pass Paddle#78711）：
- paddlex.create_pipeline("PaddleOCR-VL") BF16 推理 test_ocr.png 端到端跑通。
- 输出文本与 FP32-fallback 路径语义一致（416 vs 417 字节，仅多一处段落分隔）。
- rocprofv3 --kernel-trace --stats：FP32 GEMM 调用从 18 756 → 1 316，GPU kernel 时间 4 415.7 ms → 3 915.5 ms（1.13×）。
- 详细 benchmark 见 [ROCm] 适配 HIP BF16: 注册 BF16 layer_norm、绕开 MIOpen BF16 softmax、HIP 跳过 cuDNN-only conv2d 融合 pass Paddle#78711 PR 描述里附的 BF16 benchmark。
CUDA：未独立验证（无硬件），但 _keep_in_fp32_modules 注释明确写其目的为 "ROCm stability (MIOpen bf16 conv has bugs)"，移除后视觉塔仍按模型加载 dtype 运行；delete_pass 块包在 is_compiled_with_rocm() 下，对 CUDA 路径无影响。

The PaddleOCR-VL pipeline previously needed two ROCm-specific escape hatches: * `_keep_in_fp32_modules = ["visual", "mlp_AR"]` on PaddleOCRVLForConditionalGeneration kept the SigLIP vision tower and the multimodal projector in FP32 because BF16 layer_norm and BF16 softmax were not registered for HIP, so running the vision encoder in BF16 crashed. * Four `paddle.is_compiled_with_rocm()` blocks in `paddlex/inference/models/runners/paddle_static/runner.py` (lines 406-408, 462-464, 496-498, 505-507) called `delete_pass("conv2d_add_act_fuse_pass")` and `delete_pass("conv2d_add_fuse_pass")` because both PIR passes rewrite conv2d+add[+act] into the `fused_conv2d_add_act` op, which only has a cuDNN GPUDNN kernel — kernel dispatch then failed on ROCm. These are addressed at the framework level by the upstream Paddle BF16 fix (layer_norm + softmax registration on HIP, plus gating both PIR passes on PADDLE_WITH_CUDA so they no longer run on HIP builds). With that wheel installed, both PaddleX workarounds become unnecessary: * Drop `_keep_in_fp32_modules` so the vision encoder + multimodal projector run natively in BF16 on ROCm. End-to-end output matches the FP32-fallback path on PaddleOCR-VL-1.5 (validated on MI300X / gfx942 / ROCm 7.2). This overlaps with PaddlePaddle#5077; if PaddlePaddle#5077 lands first, the conflict is trivial. * Drop all four `delete_pass` blocks under `paddle.is_compiled_with_rocm()`. Once the framework PR lands, the two passes are no longer registered on HIP wheels, so `delete_pass` becomes a no-op there. Requires the framework BF16 PR to be merged and released; with older Paddle wheels the BF16 visual path will still crash on ROCm. CUDA behavior is unchanged — both passes remain registered under PADDLE_WITH_CUDA, and the vision encoder simply uses whatever dtype the model is loaded with.

CLAassistant · 2026-04-18T04:14:21Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

paddle-bot · 2026-04-18T04:14:25Z

Thanks for your contribution!

paddle-bot bot added the contributor External developers label Apr 18, 2026

luotao1 added the PaddlePaddle Hackathon label Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Remove BF16 workarounds for PaddleOCR-VL on HIP#5096

[ROCm] Remove BF16 workarounds for PaddleOCR-VL on HIP#5096
austin1997 wants to merge 1 commit intoPaddlePaddle:developfrom
austin1997:bf16-rocm-cleanup

austin1997 commented Apr 18, 2026

Uh oh!

CLAassistant commented Apr 18, 2026

Uh oh!

paddle-bot bot commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

austin1997 commented Apr 18, 2026

Summary

Changes

与现有 PR 的关系

Test plan

Uh oh!

CLAassistant commented Apr 18, 2026

Uh oh!

paddle-bot bot commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants